Probability, Inference, and Genotype Likelihoods
Eric C. Anderson
Conservation Genomics Workshop, Monday August 28, 2023
We will play around with a Shiny App to cement all three of these ideas (small-group work).
It can be interesting to think of probability as a unit of measure for uncertainty.
Nonetheless, E.T. Jaynes provides an interesting and accessible discussion of how the system of probability as has already been developed does, indeed, arise as the unique system that satisfies a remarkably small number of desirable properties for measuring uncertainty.
\[ P(\mathrm{Individual~is}~AA|p_A) = p_A^2 \]
Which we get because the allelic types, \(Y_1\) and \(Y_2\), of the two gene copies are independent of one another, given \(p_A\):
\[ \begin{aligned} P(\mathrm{Individual~is}~AA\;|\;p_A) &= P(Y_1=A\;|\;p_A) P(Y_1=A\;|\;p_A) \\ &= p_A p_A \\ &= p_A^2 \end{aligned} \]
In most of science we are interested in inference
I think of this as probability “running backward”
In a DAG the arrows run from unobserved nodes to observed nodes
In the DAG on the bottom we have data being genotypes from \(N\) individuals: \[ \boldsymbol{Y} = (Y_{1,1}, Y_{1,2}, Y_{2,1}, \ldots Y_{N,1}, Y_{N,2}) \]
\[ P(\boldsymbol{Y}~|~p_A) = \prod_{i=1}^N P(Y_{i,1}~|~p_A)P(Y_{i,2}~|~p_A) \] In fact, if we define each \(Y\) to be 0 or 1 as follows: \[ \begin{align} Y = 0 & ~~~~~\mbox{with probability} ~~~ 1 - p_A & ~~\mbox{(i.e., it's an}~a) \\ Y = 1 & ~~~~~\mbox{with probability} ~~~ p_A & ~~\mbox{(i.e., it's an}~A) \end{align} \] Then it is not too hard to see that \[ P(\boldsymbol{Y}~|~p_A) = p_A^{\sum Y_{i,j}} (1- p_A)^{2N - \sum Y_{i,j}} \] This is a probability function. But if you consider this as a function of \(p_A\) with \(\boldsymbol{Y}\) considered as fixed, then it is often referred to as the likelihood function.
These all give you a point estimate
An alternative to those, the Bayesian way is to calculate the posterior probability of \(p_A\), given the data.
This simply uses the laws of probability to express our uncertainty about \(p_A\) given the information in the sample.
\(P(A)\) and \(P(B)\) are referred to as marginal probabilities.
The joint probability of \(A\) and \(B\) is the probability that those both two outcomes occurred.
Joint prob = marginal probability times a conditional probability: \[ P(A,B) = P(A)P(B~|~A) = P(B)P(A~|~B) \]
So, conditional probabilities can be computed from the joint probability: \[ P(A~|~B) = \frac{P(A,B)}{P(B)}~~~~~~~~~~~~~~~~~~~~~~~~ P(B~|~A) = \frac{P(A,B)}{P(A)} \]
Thus, a much easier expression for Bayes Theorem: \[
P(A~|~B) \propto P(A, B)
\] with \(P(A|B)\) as a function of \(A\) with \(B\) being the fixed “data.”
Joint probability typically easy to compute, then just get equality by knowing \(P(A|B)\) must sum to one over values of \(A\).
Since \(p_A\) is a proportion, an obvious choice for prior would be a beta distribution. The beta distribution gives a continuous prior distribution on a value that is between 0 and 1. It has two parameters, often called \(\alpha_1\) and \(\alpha_2\). Here are some examples:
The beta density for a random variable \(X\) has the form: \[ p(x | \alpha_1, \alpha_2) = \frac{\Gamma(\alpha_1 + \alpha_2)}{\Gamma(\alpha_1)\Gamma(\alpha_2)} x^{\alpha_1 - 1}(1-x)^{\alpha_2 - 1} \] The part that looks hairy is a few Gamma functions. Don’t worry about those—it is a constant. The important part (the “kernel”, as they say…) is: \[ x^{\alpha_1 - 1}(1-x)^{\alpha_2 - 1} \] Or, if we wanted this to a be a prior on \(p_A\), the prior would be proportional to: \[ {p_A}^{\alpha_1 - 1}(1-p_A)^{\alpha_2 - 1} \] And, if we wanted to be even more specific, we could choose \(\alpha_1 = \alpha_2 = 1\) to give ourselves a uniform prior which is proportional to 1: \[ P(p_A) \propto {p_A}^{1 - 1}(1-p_A)^{1 - 1} = {p_A}^{0}(1-p_A)^{0} = 1 \]
We can’t run shiny on the ConGen RStudio Server
But you can run it on your own computer with RStudio. The procedure for that is:
if(!("usethis" %in% rownames(installed.packages()))) {
install.packages("usethis")
}
usethis::use_course("eriqande/ngs-genotype-models")001-allele-freq-estimation.Rmd.If that does not work for some of the students: you can go directly to the website: https://eriqande.shinyapps.io/001-allele-freq-estimation/. But, be warned that if too many people use that link it will overwhelm my free ShinyApps account.
Take a few minutes to introduce yourselves to one another, then play with the Shiny app and talk to one another about it as you do. Maybe even work together to do these things:
I’ll try to say hello in the different breakout rooms.
The first 10 values of that sample look like:
[1] 0.3535304 0.3765251 0.3717175 0.4140024 0.3982999 0.3554734 0.3935456
[8] 0.3435761 0.3470759 0.3772759
We call this a “vanilla” Monte Carlo sample because every member of the sample was independent—this is not Markov chain Monte Carlo.
If we wanted to use the sample to approximate the full posterior, we could do that with a histogram:
If we want the posterior mean, that is easy, too:
as is the posterior median:
Or the standard deviation of the posterior distribution:
Or, the 90%-equal-tail Credible Interval
All of those quantities could have been obtained analytically in this case, but it is a lot simpler to just work with a Monte Carlo sample because the operations are the same with every Monte Carlo sample (which is not necessarily true of every analytical distribution…).
\(i\) is homozygous for \(C\) (i.e., \(Y_{i,1} = Y_{i,2} = C\)):
\(i\) is homozygous for \(T\) (i.e., \(Y_{i,1} = Y_{i,2} = T\)):
\(i\) is heterozygous (i.e., \((Y_{i,1} = C, Y_{i,2} = T)\)
The genotype likelihoods can be calculated from the read data alone.
But, if we combine that with allele frequencies in our model:
…then we can also compute the posterior probability (by MCMC sampling) of each genotype. If you are sampling from a single population, then this provides a much better estimate of the true genotype.
From the ngs-genotype-models RStudio project that you downloaded previously:
002-genotype-likelihoods-from-reads.RmdIf this doesn’t work for you, then you can access the Shiny app over the web: https://eriqande.shinyapps.io/002-genotype-likelihoods-from-reads/
There are more questions for thought at the bottom of the Shiny App Notebook, too.
You will often hear that using genotype likelihoods allows you to “propagate uncertainty” to downstream analyses.
This is good practice—it can help you to not underestimate your uncertainty.
Our third exercise with a ShinyApp addresses this question, revealing some of the unfortunate things that can happen if you call genotypes from low coverage sequencing data and you treat them as known/certain.
From the ngs-genotype-models RStudio project that you downloaded previously:
003-read-inference-gsi.Rmd(If this doesn’t work for you, then you can access the Shiny app over the web: https://eriqande.shinyapps.io/003-read-inference-gsi/ )